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We present a statistical approach to protein structure by introducing a representation of protein 
folds based on simple observables defined as frequencies of oriented cycles in contact graphs. Moti- 
vated by the idea that these cycles may form a code of the entire protein structure we investigate 
its Shannon entropy. The latter surprisingly shows a characteristic transitory behavior when traced 
over different values of the geometric threshold t which defines protein residues in contact. To ac- 
count for this observation, we propose a non-linear mechanical model — formulated as a Hamiltonian 
system in which t is regarded as a continuous parameter — that allows to identify the behavior of 
the Shannon entropy as a first-order phase transition between a disordered and an ordered phase 
of the proposed mechanical system. The transition itself reflects the formation of protein structure 
when represented in terms of contact graph cycles, and it is identified as an example of a fragmenta- 
tion transition known from several other statistical systems without a proper thermodynamic limit. 
Some interesting implications follow from our model including chirality, broken t-symmetry and 
one- dimensionality. Although defined from purely structural considerations, we further show that 
for native protein structures these observables follow some of the quantitative rules that are typi- 
cally valid for amino acid types in protein sequences. Moreover, this relation between polypeptide 
structures and their amino acid sequences suggests a specific protein design alphabet. 
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I. INTRODUCTION 

How should we represent a protein? Depending on the 
given scientific context there are several valid represen- 
tations to choose from. For example, a protein may be 
viewed as a single point in phylogenetic tree spaces; or 
as a set of many (quantum) point particles interacting 
through Coulomb forces in 3D space; another legitimate 
representation is a finite sequence from the amino acid al- 
phabet; and we may also describe a protein as a so-called 
contact graph, i.e. a graph defined through contacts be- 
tween its amino acid residues. Indeed, the latter repre- 
sentation has recently provided insights into the structure 
of protein folds. For example, the reconstruction of the 
protein's native structure from the knowledge of the con- 
tact map's principal eigenvector or the idea that those 
graphs are small-world networks in which residues of high 
connectivity ("hubs") play a significant role in the fold- 
ing process These and related studies indicate that 
contact graphs are a phenomenologically rich way of rep- 
resenting proteins. 

In the present work we introduce a representation of 
proteins derived from simple topological properties of con- 
tact graphs to demonstrate several aspects of protein 
folds. This representation is given through a family of ob- 
servables which are the relative frequencies qu of oriented 
cycles of different lengths k present in contact graphs. 
Every contact graph is composed from cycles of various 
lengths k, and so every q^ simply is the frequency of or- 
dered contact pairs of residues that are (k — 1) sequence 
steps apart. The frequency values directly depend on 
the choice of the geometrical threshold t which defines 
contacts between protein residues (i.e., two amino acid 
residues are said to be in contact if their pairwise dis- 



tance is smaller than f): the larger the threshold t the 
more contacts we count in each q^. To investigate this 
dependence from an information theoretic point of view 
we introduce the Shannon entropy H c for the entire dis- 
tribution of frequencies {<?&}• Our first result is that for 
native protein structures H c is a non-monotonic and par- 
tially convex function of t. Interestingly, a characteris- 
tic minimum of the entropy at t* ~ 5.6A localizes the 
convex region. This alone is remarkable for two reasons. 
First, it readily establishes a structural characteristic of 
proteins, since a local entropy minimum does not appear 
for totally random chains. For example, we show that for 
3D Markovian random walks H c {t) is a monotonically in- 
creasing and linear function. Second, convexity in entropy 
functions for certain statistical systems is known to be an 
indicator for phase transitions 0, However, at this 
point we only have the entropy function H c at hand, rais- 
ing the important question whether we can identify any 
actual system that undergoes a phase transition indicated 
by convex entropy regions. 

To address this question we introduce a mechanical 
model, defined as a Hamiltonian system in which t plays 
the role of a time parameter, that accounts for the func- 
tional behavior of the entropy function. This model fol- 
lows naturally from the Shannon entropy H c and it im- 
plies several intriguing aspects from non-linear Hamilto- 
nian mechanics (instability, broken time reversal symme- 
try, chirality) and from non-equilibrium statistical me- 
chanics (first order phase transitions). In fact, we show 
that this Hamiltonian system can be interpreted as a 
rather simple classical mechanical system of many non- 
linearly coupled constituents (which we call mechanical 
particles) moving in one dimension. We demonstrate that 
the Hamilton function of our system is a sum of a non- 
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linear kinetic term K and a hypothetical potential en- 
ergy term U, which means that we do not further specify 
the functional form of U. This, however, does not invali- 
date the arguments given in this work because the relevant 
points of our analysis solely depend on the well defined ki- 
netic term K. Using this formulation, we can regard con- 
vex regions in the entropy function as locations of phase 
transitions where the mechanical system becomes inhomo- 
geneous as several of its particles coalesce into fragments 
when a certain value range of the geometrical parameter 
t has been reached. We argue that this transition and 
its concurrent fragmentation mainly reflect the formation 
of protein secondary structure when going from small to 
larger values of t. Hereby, we also show that there occur 
two adjacent first-order phase transitions. They deter- 
mine a special value of the geometrical threshold parame- 
ter which turns out to be consistent with previous studies 
on contact graphs Contact graphs defined upon the 

choice of this numerical value are known to carry the en- 
tire structural information 0. Until now, its choice was 
motivated mainly by heuristic arguments and our analysis 
provides physical arguments instead. 

Transitions of the kind reported here are sometimes 
called (multi-)fragmentation transitions, and so far this 
phenomenon has been discussed for physical systems far 
from the thermodynamic limit such as atomic nuclei, 
mesoscopic systems and self-gravitating objects 0,0, 
With the present work, a fragmentation transition is pro- 
posed in the protein context for the first time. In this 
manner we develop a framework where the formation of 
protein structure can be studied as a phase transition of 
a classical mechanical model that literally evolves in the 
threshold parameter t. This is a new alternative to the tra- 
ditional view on proteins as multi-particle quantum phys- 
ical systems that evolve in physical time. Thus we suggest 
a new statistical and mechanical representation of protein 
structures. 

To substantiate the significance of these results, we ar- 
gue that our proposed representation of proteins is a spe- 
cific candidate for the protein design alphabet 10] . In 
doing so, we discover some striking similarities between 
the usage of amino acid types in protein sequences and 
the usage of the purely structurally defined family of cy- 
cle lengths k in proteins. For instance, we show that at 
most twenty different cycle lengths are necessary to en- 
code the structure of a contact graph when defined upon 
the distinguished minimum value of the contact threshold, 
t = t*. Using the same threshold value, we further reveal a 
strong linear correlation between protein sequence length 
and the resulting number of contact graph cycles. Inter- 
estingly, this correlation sharply drops for smaller values 
of the geometrical contact threshold making the minimum 
of the entropy H c (t) peculiar again. These similarities es- 
tablish a relation between protein sequence and structure 
and they strongly support evidence that amino acid se- 
quences contain structural information encoded in lengths 
of contact graph cycles. 

This paper is organized as follows. In the next section 



we define the statistical observables in question and intro- 
duce their frequency distribution as well as the associated 
Shannon entropy H c . In Section ITTT1 we describe and dis- 
cuss the functional behavior of H c (t) by considering an 
ensemble of many native protein structures obtained from 
the DALI/FSSP database. Here, we also elaborate the 
idea of looking at our representation as a protein design 
alphabet, and discuss implications to the protein design 
problem. Section IIVI introduces the mechanical model- 
in terms of a Hamiltonian system- for the cycle frequen- 
cies qk- In this section several properties and implica- 
tions of our model formulation are analyzed. They include 
instability, broken ^-symmetry (Section II V Af) . chirality 
(Section IIV Bl) . and first-order phase transitions (Section 
IIV CJI . We close with section where we draw several 
conclusions and briefly outline possible future directions 
of our work. 

II. CONTACT GRAPHS, ORIENTED CYCLES 
AND THEIR DISTRIBUTION 

Our model observables are frequencies of oriented con- 
tact graph cycles and they form a discrete probability dis- 
tribution when taking an ensemble average over a sta- 
tistical ensemble V of many protein structures. Every 
ensemble member with an amino acid sequence of inte- 
ger length L is identified by its contact matrix C £ V. 
The protein residues are represented by the ordered set 
{1, . . . ,L} such that the integer 1 is associated with the N 
terminal residue and L is identified with the C terminal. 
The contact matrix is derived from the spatial coordinates 
of the carbon C a atoms taken from the protein's native 
structure. Since each C Q atom uniquely corresponds to 
a single residue of the amino acid chain, one can decide 
whether or not any two arbitrary residues are close to 
each other by using a Euclidean distance threshold t of 
typically around 9 A |(| (In what follows, we will always 
represent t as a dimensionless multiplier of 1A — 10 -10 
m.). A contact matrix is a L x L symmetric matrix with 
zero entries for all residue pairs whose mutual distance is 
larger than t, and with entries equal to one for all remain- 
ing pairs which are called contact pairs. Since in a protein 
sequence every nearest and next to nearest residues are in 
contact, we put C(i,j) = 1 for all i, j £ {1, . . . , L} with 

K-il<2. 

To define these observables, we first ask for the number 
qk of contact pairs for any given sequence distance 
k — 1 = j — i with 2 < fc — 1 < L — 1 and with j > i. We 
obtain this number by evaluating a conditional sum over 
the entries in the contact matrix C, i.e. 

q k = £ C &3)- C 1 ) 

3>i, 
k—l—j—i 

This number can be put readily into a graph theoretic 
context. Since any contact matrix represents a graph G 
of L nodes and with arcs (i,j) between exactly those nodes 
which are in contact, the frequency qk gives the number 
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of closed paths of length k which are followed in their 
first k — 1 steps according to the direction imposed by the 
ordered set of the residues {1,2,..., L}. Thus it is the 
ordering of the protein's amino acid chain going from the 
N to the C terminal that gives rise to a N-C direction all 
closed paths counted by q k - In this way, we discard all 
cycles (i.e., closed paths) in G that have more than one 
step opposite to the N-C direction. 

Next we give a statistical meaning to each by tak- 
ing the ensemble average over all contact matrices in V . 
Let us consider the number L m being the maximum amino 
acid sequence length in V . For each 3 < k < L m we calcu- 
late the relative ensemble frequency q^ = N^ 1 ^2 CeV qk 
with N = Ylk^Cev^' an d thus we obtain a discrete 
probability distribution {qk} = {<?3, ■ ■ ■ , q_L m } (Each q\ is 
henceforth denoted as qk-)- We regard this distribution as 
a statistical distribution among the letters of an alphabet 
consisting of all possible N-C cycle lengths 3 < k < L m . 
It is convenient to introduce the index set valued func- 
tion s(P) such that k e s(V) iff qk > 0, and to denote 
n + = \s(V)\. We can now define the Shannon entropy H c 
for {q k } as 

H c = - Yl %log<?fc. ( 2 ) 

Here, the individual frequencies qk are functions of the 
distance cut-off t, because each contact matrix C € V is 
defined upon its choice. Therefore, n + and H c are both 
functions of t and our next aim is to explore their func- 
tional behavior. 



III. FUNCTIONAL BEHAVIOR OF THE 
ENTROPY H c AND THE PROTEIN DESIGN 
ALPHABET 

To study H c (t) for an ensemble of real proteins, we 
have realized V using the DALI/FSSP database of na- 
tive protein structures |llj (We have considered only the 
first available chain/domain in each database file and the 
maximum domain length in this database determines L m , 
which turns out to be L m — 1058.). Fig. 2] shows the 
resulting entropy H c as a function of the distance thresh- 
old t. We see an onset and an expected increase of H c at 
approximately t « 3. However, the monotonic concave in- 
crease of H c is interrupted at a range 4.8 < t < 5.8, where 
H c decreases between a maximum at t* « 4.8 and a min- 
imum at ss 5.6. The local minimum also marks the 
region where H c is not a concave function of i, but where 
it is instead characterized by convexity. The drop of H c 
to a local minimum at shows that most of N-C oriented 
cycles in V can be effectively encoded with an alphabet 
of only n* = exjp[H c (t*)] letters. This statement follows 
since ?i + (£*) « 670, while the function n + saturates at a 
comparable value n + w 920 for large t. And at local min- 
imum with H c (t t ) w 2.97 we get n* w exp(2.97) = 19.5, 
a much smaller number than the total number of realized 
letters of the cycle alphabet, n + (t„) w 670. 



To further confirm these findings we tested different 
choices of the ensemble V by lowering the maximum pro- 
tein length L rn of chosen protein structures from the 
DALI/FSSP database. All choices show the local max- 
imum and the minimum at the same values of t* and t*. 
However, the heights of the extrema change- in particu- 
lar, Hciti,) as a function of L m is monotonically increasing 
with an asymptotic value of n* 19.5 for values of L m 
larger than ~600. 

For comparison, Fig. ^ also includes the entropy func- 
tion H c (t) but for an ensemble V r of 3D Markovian ran- 
dom walks with the same total number of randomly gener- 
ated chains, i.e. \V r \ = and with the same individual 
chain lengths as given by the DALI/FSSP ensemble of 
structures. Each chain in V r has a fixed step length of 
As = 3.8A which is comparable to the average distance 
between two neighboring C Q atoms in proteins. In con- 
trast to the previous case, the entropy function does not 
show any convex intruder (in fact, it is almost a linear 
function of t within the t-range considered), and so we 
conclude that the local minimum along with the convex 
region to which it corresponds is a characteristic signature 
of native proteins. 

One explanation for the local minimum of H c could be 
that most of the oriented cycles in the distribution {qk} 
correspond to secondary structures forming alpha-helices. 
This is supported by the fact that each 2n turn in an 
alpha-coil typically results in three N-C oriented loops 
of length 3, 4, and 5 (they correspond, respectively, to 
residues that are in contact and which are 2, 3 and 4 
steps apart in the amino acid chain). Since alpha-helices 
have an average period of 3.6 chain steps and because it 
is known that alpha-helices make up a fraction of around 
30% of the total secondary structure found in proteins, a 
large portion of the cycle distribution {qk} should repre- 
sent short loops of length 3, 4, or 5. And we indeed verify 
that the relative frequencies (73, (74, and q$ contribute most 
in {qk}- However, we think that this qualitative expla- 
nation cannot account for the characterstic values of the 
entropy function found at the local minimum, as we find 
that n» w 19.5 is larger than 3 = |{3,4, 5}|, and hence 
a three letter cycle alphabet of lenghts {3,4,5} cannot 
encode the information given in {qk}- 

Alternatively, one sees a connection of the asymptotic 
value of w 19.5 and the twenty amino acids types 
that form the alphabet of all protein sequences — raising 
the possibility that there is a deeper understanding be- 
hind the similarity between the two numbers. More- 
over, the total number of twenty alphabet letters turns 
out to be special even when looking at the distribution 
{qk} itself. Figure |2] shows two examples (for t = 5.0 
and for t — 9.0) of this distribution extracted again from 
the DALI/FSSP database with maximum available chain 
length L m — 1058. The resulting graphs are partitioned 
into three different ranges where each range has a different 
scaling behavior: the larger the cycle length k the steeper 
the descent of the graph. Interestingly, the first range A is 
characterized by a rather shallow decrease in absolute fre- 
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quencies qk- In contrast, range B for k G {23, . . . , ^400} is 
clearly different from A by its steeper slope (mg « —1.7 in 
double- logarithmic scale). And, surprisingly, the bound- 
ary point between range A and range B is located al- 
most exactly after the the first twenty N-C cycle lengths 
{3, . . . , 22}. Since the cycle frequencies in range A have 
by far the highest values within the entire distribution, 
we again have a situation where an effective set of twenty 
letters emerges from our underlying alphabet of cycles. 
If not given by pure chance, these correspondences are 
remarkable because at no point have we included any in- 
formation from the amino acid sequence into our study of 
the cycle length distributions; we recall that the latter are 
based only on the structure of contact graphs. Are these 
correspondences telling us that amino acid sequences and 
cycle lengths in protein contact graphs have something 
fundamental in common? 

To address this question, we argue that contact graph 
cycle lengths are a candidate for the hypothetical protein 
alphabet as discussed recently by Dokholyan [T3|. Ac- 
cording to his observations, less than statistically expected 
amino acid types are used in natural protein structures, 
from which he concludes that amino acids might not be 
the natural code to describe protein structure. One con- 
sequence that he draws from this fact is that a family of 
more than twenty "abstract objects" constitutes this 
natural coding alphabet for structures. Identifying this 
structural alphabet would be an important step forward 
in the protein design problem — which is the difficult task of 
designing a sequence that folds into a targeted polypeptide 
structure. Using our structural representation, our claim 
is that these abstract objects are N-C oriented cycles. 

To support this claim, we first look at the amino acid 
usage given a mean protein domain length obtained by 
Dokholyan [ljj from the DALI/FSSP database (Fig. OJ). 
The resulting graph shows how many different amino acid 
types are used given an average protein domain length; 
it is characterized by a strong increase from 10 to about 
18 amino acid types when going to domain lengths up to 
~ 150 amino acids, followed by a saturation at 20 amino 
acid types for longer domains. Now returning to a rep- 
resentation based on cycle lengths, we observe in Fig. 
an almost identical functional behavior (within the un- 
certainty given by the error bars) for the effective cycle 
number, n*, evaluated in the same range of mean domain 
lengths taken from DALI/FSSP. This surprisingly shows 
that amino acid usage in sequences and cycle length us- 
age in structures with corresponding sequence lengths be- 
have the same way. In a related analysis, shown in Fig. 
0] each sequence length from of DALI/FSSP is plotted 
against its corresponding total number of cycles evaluated 
at t = t*. This graph shows a well exposed linear depen- 
dence between both numbers with a optimum linear slope 
of m k 0.91. Also in Fig. 0] are shown the resulting 
Pearson correlation coefficient r and the linear regression 
slope m as functions of t. We observe that r strongly rises 
from r w 0.1 (uncorrelated state) to r « 1 (strongly cor- 
related state) between t w 4 and t ~ i* , indicating that at 



t f=s i* correlation between sequence length and cycle num- 
ber reaches its absolute maximum (upper left insert in Fig. 
|IJ. At the same time, interestingly, the optimum linear 
slope m approaches a value of one in the neighborhood of 
t w t* (lower right insert in Fig. 0J. It is therefore evident 
that, in a statistical sense, the length of an amino acid 
sequence and the number of corresponding contact graph 
cycles are the same when evaluated at the characteristic 
minimum t* of the entropy function H c (t). In this way we 
have detected two quantitative relations between sequence 
and structure. Both relations suggest that the family of 
all possible cycle lengths k constitutes a protein design 
alphabet, thus directly supporting Dokholyan's original 
conjecture. 

Taking into account these results we want to briefly dis- 
cuss their implication for the protein design problem. Sup- 
pose we want to design a certain protein, or any protein- 
like amino acid polypeptide, and thus we set up its target 
structure by specifying the 3D coordinates of its C a atoms 
which together form the polypeptide's carbon backbone of 
length L. What constraints does this information already 
put on the amino acid sequence S aa that would ideally 
fold into the target? Given the initial data we first may 
derive the structure of the corresponding contact graph 
evaluated at minimum entropy contact threshold t* ~ 5.6. 
This graph concurrently defines a sequence S of N-C ori- 
ented cycles with lengths 3 < k < L or, equivalently, 
an equally long sequence S' of ordered contact pairs of 
residues being 2 < (k — 1) < (L — 1) sequence steps apart. 
Thus S lists the cycle lengths of our target structure, e.g. 
S = {3,4,29,3,...} and S' = {2,3,28,2,...}. At this 
point we know that we can safely reduce the number of 
different cycle lengths in S to a maximum of twenty dif- 
ferent lengths (see Fig. |3J). We stress, though, that we 
do not know a priori which cycle lengths can be discarded 
from the original alphabet. Although we have shown that 
statistically the twenty shortest cycle lengths contribute 
most to the distribution {qk}, this does not imply that all 
of them are non-redundant, i.e. not all of them have to be 
necessarily indispensable to represent the structure. So 
the first difficulty is the identification of the twenty or less 
cycle alphabet letters which are necessary to encode the 
structure of the contact graph. Our results further indi- 
cate that the length of the amino acid sequence, \S aa \ = L, 
and the total number of cycles in S should coincide (see 
Fig. 2}, i.e. \S\ ~ L. Thus we expect S aa to have both the 
same length and the same number of different letters as S, 
which suggests that there exists a one-to-one correspon- 
dence between both sequences. But here we come to a 
second obstacle: we do not know the natural order of the 
code letters in S. In other words, we need to know what 
given position in S corresponds to what position in S aa 
so that translation in the correct order of letters would 
eventually generate the aimed fold. Hence we conclude 
that although we have revealed a connection between the 
targeted structure and the sequence it takes for it to fold, 
protein design still remains a problematic task, of course. 
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IV. MODEL FORMULATION: THE 
NON-CONSERVATIVE HAMILTONIAN FLOW 

A. Equations of motion 

We are now going to introduce a theoretical model 
which follows naturally from the entropy function H c (t), 
and which gives a novel physical description of its func- 
tional behavior. This model is cast in the framework of 
Hamiltonian mechanics in which the distance threshold t 
takes the role of the usual time parameter. As we shall 
see in the following sections, this formulation allows us 
to interpret any convex region in H c (t) (see Fig. as a 
location where a first order phase transition occurs in a 
multi-particle and non-linear Hamiltonian system evolv- 
ing in t. We further show that this evolution is not t- 
symmetric and that the mechanical particles themselves 
carry chirality. 

To reach our goals we initially consider each relative 
frequency q k as a continuous function of t. This is an ap- 
proximation because each corresponding absolute number 
of N-C oriented fc-cycles can increase only step-wise as t 
increases. However, as long as the total number of mea- 
sured cycles remains much larger than its (discontinuous) 
increase due to a small change in t the continuum approx- 
imation should hold. The continuum condition is well 
met for large enough ensembles such as the DALI/FSSP 
database, and our remaining discussion is based upon its 
validity. 

At each local extremum of H c (t) it is 

H c (t) = = -£ft(t)[l+log %(*)], (3) 

k 

with t £ {t*,t*}. Each term of this sum over k may 
therefore be written as 



-q k (l + logg fe ) = Pk 



(4) 



withpfc 6 R, and so condition J2J reads as J2kPk = 0- We 
now take a more general view on Eq. (@J and put forward 
the assumption that it remains valid for all values of t, 
including the extremum points, where n + (t) is non-zero 
at the same time. As a consequence, the right-hand-side 
of the extremum condition ^2 k p k = may become pos- 
itive or negative. In this situation Eq. Q is read as 
a proportionality statement between each "particle veloc- 
ity" q k and the corresponding "momentum" p k , and so we 
may rewrite it as a Hamilton equation for each frequency 
Qk, 



Qk 



d 
dpk 



(K + U) 




(5) 



where K — Y^ fc pl/(2m k ) is the total "kinetic energy". 
The proportionality factor between velocity q k and mo- 
mentum p k is the function 



which we refer to as the "particle mass" of the fcth par- 
ticle, and U = U(qs, . . . ,qL m ;t) is a "potential energy" 
function. With the Hamilton function (K + U) we also 
introduce the Hamilton equation for each momentum p k , 

» = -f (K + tf )= __L (*Y- » (7) 
dq k 2q k \m k J dq k 

Together with the normalization and the positivity con- 
straints for the frequencies, J2 k Qk — Ylk \lk\ — E Eq- © 
and |(7J| constitute our hypothetical Hamiltonian flow in 
the phase space of the L m conjugate pairs (q k ,Pk)- Thus 
each pair (qk,Pk) is understood as a "mechanical particle" 
with mass m k . The system itself is hypothetical because 
we have not yet identified the potential energy function 
U. Note that normalization and positivity really are ad- 
ditional constraints here, because in our case Hamilton's 
equations are non-linear in all conjugate variable pairs. 
We also remark that the Hamiltonian system is inherently 
unstable: considering free dynamics by putting U = re- 
sults in a negative change of the momentum, according 
to Eq. J7J. This behavior marks a clear deviation from 
Newton's first law. 



B. Chirality, statistics, and t-symmetry 

Let us look at a single particle with index k in phase 
space. Hamilton's equations for (q k ,p k ) imply a special 
case for those particles with "positions" q k near the 1/e 
point, because the mass m k vanishes at any q k = 1/e. 
Thus if we demand that each particle velocity q k should be 
finite, then its momentum p k must vanish sufficiently fast 
at the moment of crossing the 1/e point. There are four 
possible situations of how this crossing can occur. Let t c 
be the crossing time point and let us observe the particle's 
behavior in some small neighborhood of t c . In the first 
possibility (type I, see Fig. E}, q k approaches the 1/e 
point from below, that is from the interval o = (0, 1/e), it 
crosses the point once at t = t c , and eventually it returns 
to region o. Concurrently, the particle's momentum p k is 
positive for all t < t c , it vanishes at the crossing point, 
and finally changes its original sign. We introduce the 
particle's chirality Xk(t) defined as 



Xk 



qk Pk 
\4k\ \pk\ 



m k 
\m k \ 



(8) 



m k = -(1 + logq k ) 



(0) 



and call a particle with Xfe = — 1 left-handed, while a par- 
ticle with Xk = +1 w e name right-handed. Clearly, for a 
type I crossing, Xk{t) is positive for all t < t c , and it re- 
mains so for t > t c ; thus velocity q k and momentum p k are 
always parallel and no change in the particle's chiral state 
occurs. The situation is different for the other possible 
crossings. In fact, whenever it is q k > 1/e, the particle's 
mass m k becomes negative and therefore the particle itself 
must be left-handed. Normalization of {q k } requires that 
there can be at most two left-handed particles. In this 
manner we find two crossing types (types III and IV in 
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Fig. [SJ where a change in chirality occurs (type III means 
right-handed at t < i c , and type IV is initially left-handed 
at t < t c ), and two others (type I and type II in Fig. EJ) 
where the chiral state is conserved (type I is right-handed 
for t < t c , while type III is initially left-handed). Thus 
crossings of the l/e line in the particle positions qk can 
change parity. 

Given the above rules we want to derive crossing statis- 
tics for many particles. Let the iV-tuple ■ • ■ , Xn) rep- 
resent the TV-particle chiral state at times t < t c before 
crossing. What is the expected fraction v^\n) of right- 
handed particles after one crossing? This fraction must be 
a function of the particle number N and of the three pos- 
sible initial states % £ {0, 1, 2} respectively corresponding 
to the system initially having zero, one, or two left-handed 
particles. To obtain 2/. (TV) we presume that each cross- 
ing type I- IV is equally probable, and solve the resulting 
combinatorial problem for all N > 3: 



,(i) 



(TV) 
(TV) 



3N 2 -2N + 1 
3JV 2 + iV 

3iV 2 - 5iV + 3 
3N 2 - N 



(9) 



(N) = I-—. 
K ' AN 



All three values monotonically approach 1 as for large N 
the vast majority of the particles must be right-handed. 
On the other hand, for small N we find a decrease of 
the mean fraction reflecting the statistical preference 
of the system being mostly left-handed. For instance, 
at N — 3 the mean fraction becomes pW(iV = 3) = 
jELo^fS) ~ 0.65, for N = 2 we get z/«(2) = 1/4, 
and in the trivial case of N = 1 we have D^{1) = 0. Fur- 
ther studies should allow multiple crossings resulting in a 
calculation of i?( c \N) for c > 1. 

Furthermore, we argue that the Hamiltonian system 
has broken time reversal invariance, i.e. it is a non- 
conservative Hamiltonian system. Inverting the sign of 
the velocities, qk — > — <jk, in Eq. (J5J accounts for a con- 
current change in the sign of the momenta pk, but time 
reversal symmetry cannot be sustained in the momentum 
Eq. |(7J| because the first term on the right hand side is 
negative for all values qk and pk ■ 



C. First-order phase transitions 

The statistical physics of systems with a relatively small 
number of degrees of freedom has been recently inves- 
tigated based on their micro-canonical properties 0, 
These systems often cannot be described properly in a 
thermodynamic limit which is valid for large and locally 
homogeneous systems. Therefore they are referred to as 
small [1,^2 or as finite 0] statistical systems. Known ex- 
amples of small systems are self-gravitating objects (e.g., 
stars) and atomic nuclei , and our aim is to describe the 
behavior of our mechanical system within this framework. 



The entropy function is an indicator for phase transi- 
tions in small or finite statistical systems (3, H, El ■ Phase 
transitions are then characterized in terms of the topology 
of the entropy function: concavity indicates a pure phase 
while convexity reflects a transitory behavior where sev- 
eral different phases coexist. If the entropy S is a function 
of several observables {xi, . . . , x n } then the largest eigen- 
value Ai of the curvature matrix with entries d Xl d Xl S with 
(m,l) £ {!,..., n} 2 determines the order of the phase 
transition: for Ai > the transition is of first order while 
second order phase transitions prerequisite a vanishing Ai . 
The eigenvector corresponding to the largest eigenvalue 
constitutes the local order parameter. In our case, the 
entropy function is the Shannon entropy H c and the ob- 
servables are the absolute numbers of different N-C loop 
lengths for a given ensemble V . Since there is a one-to- 
one correspondence between the threshold values t and the 
absolute numbers of the loop lengths within an statisti- 
cal ensemble V (they are strictly monotonically increasing 
functions of t), it is sufficient to consider t as the only in- 
dependent observable in H c . Hence convex regions with 
positive second derivative of H c (t) indicate a first order 
phase transition, and the local order parameter is 



alt 2 



(10) 



There is a convex region of the entropy function around 
= 5.6 (Fig. 2]) separating two concave regions at 
3 ^ t < 5.4 and at t > 6. Therefore a first order 
phase transition occurs at £*. Interestingly, we see a sec- 
ond convex region of the entropy function in the interval 
6-5 < t < 9 being much less prominent than the first at 
5.4 < t < 5.8. This observation implies that there are two 
first order phase transitions occurring close to each other. 

First order phase transitions in small systems signal 
the fragmentation of the system constituents into clusters, 
and thus indicate a preference of the system to become in- 
homogeneous and to separate into different phases. From 
this point of view we can interpret the behavior of H c (t) 
as a transition from a random phase to an ordered phase 
where the formation of secondary protein structure, espe- 
cially through alpha-helices, leads to fragmentation into 
clusters of those cycle lengths that are associated with its 
formation. Intuitively, this process can be understood as 
collapse into protein (secondary) structure in the abstract 
space of cycle length frequencies. From this point of view, 
we may argue that after the two phase transitions are 
completed protein structure — when expressed in terms of 
cycle lengths — is fully obtained. The least geometrical 
threshold for which this situation occurs is at about t w 9 
(see Fig. QJ, because it right there where the convex re- 
gion in H c (t) ends. Interestingly, in previous studies on 
protein contact maps this value turned out to be special. 
For example, in [(| it is argued that a choice of t 9 is 
determined by the requirement that the average number 
of C a -C a contacts for each amino acid is roughly equal 
to the respective numbers obtained with the all-atom def- 
inition of contacts. In other words, for this choice the 
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statistical properties of contact graphs do not depend on 
the details of the definition of a contact pair. This argu- 
ment makes well sense in our setting because it is known 
from statistical physics that near a critical point, i.e. near 
the point of the phase transition, many system properties 
do not depend on a detailed description of the system it- 
self. Also in |(J this particular value of t was used to show 
that a protein 3D structure can be recovered completely 
from the knowledge of the contact map. This study con- 
firms our previous argument saying that at this value of 
the threshold parameter the structure is encoded in the 
contact graph. 

Beside convex regions in the entropy function, phase 
transitions in small systems are known to be accompa- 
nied by large fluctuations in kinetic energy 0, 0] , and 
we want to verify this fact in our approach. Using again 
the DALI/FSSP database, Fig. shows the total kinetic 
energy K(t). We observe strong fluctuations of K for t- 
values smaller than t « 5.5, that is right before the first 
order phase transition located at £* « 5.6 (see Fig. QJ. 
These fluctuations sharply drop for slightly larger values 
of t. There is a second peak of smaller amplitude in K 
at t ks 6 indicating the onset of a second first order phase 
transition at the second convex region in H c (t). The iden- 
tification of the emerged system fragments is done by se- 
lecting those groups of particles which show the largest 
fluctuations in kinetic energy; in agreement with our ear- 
lier arguments we find in both transitions the main frag- 
ment consisting of the particles labeled with 3, 4, and 5. 
The observed fluctuations in kinetic energy K become an 
important validation of our proposed theoretical model 
because they show that the constructed Hamilton func- 
tion (K + U) indeed is a natural choice to be an indicator 
for the transitory behavior between phases. That is to say, 
negatively, that an unnatural choice of K would not be a 
specific indicator for the phase transitions considered. 

As the last point of our model analysis we want 
to demonstrate that the mechanical system is one- 
dimensional. In 01 it is argued that the depth As of a 
convex intruder in the entropy per particle s — S/N scales 
with the number of surface particles at the spatial phase 
boundaries, thus it is As ~ N~ x / d , with d being the spa- 
tial dimensionality. For our chosen ensemble V we have 
calculated the entropy depth of the convex intruder AH C 
in the range of maximum chain length 100 < L m < 1058 
(see also Fig. ^for the construction of AH C ). Within 
this range we have not found a significant change of AH C , 
and we therefore obtain As ~ (n + ) . Since the num- 
ber of surface particles is independent of the total number 
of particles only in systems with dimensionality one, we 
conclude that d = 1 for the finite system of particles rep- 
resented by the Hamilton equations J^J and JJJ. Our di- 
mensionality argument signals that the system essentially 
behaves like multi-particle system in one dimension. By 
this term we mean that we cannot arbitrary remove parti- 
cles and concurrently expect that the transitory behavior 
between phases remains. For example, by removing the 
particles labeled with k e {3,4,5} the system would not 



exhibit a phase transition at t* anymore, and consequently 
As would cease to have a physical meaning. But as long as 
this phase transition is preserved, the corresponding depth 
of the convex intruder is essentially independent from the 
particle number — thus suggesting one-dimensionality. 

V. CONCLUSIONS AND OUTLOOK 

We have described several unsuspected features of pro- 
tein folds by introducing a representation based on simple 
statistical observables. They are defined as frequencies of 
cycles of different integer length k in contact graphs, and 
we have shown that they encode a characteristic statis- 
tical signature when expressed by their Shannon entropy 
H c . This signature consists of convex regions in the en- 
tropy when considered as a function of the geometrical 
contact threshold t. Based on this representation, we have 
proposed a classical Hamiltonian system as a mechanical 
model for the observed behavior of the entropy function 
H c (t). This model shows several striking features includ- 
ing chirality, broken time reversal symmetry, first order 
phase transitions in the context of small statistical sys- 
tems, and one-dimensionality. Concurrently, it allows to 
see convex regions in the entropy function as regions where 
fragmentation of the mechanical system constituents oc- 
curs in a first-order phase transition reflecting the gener- 
ation of protein structure. This transition- also known as 
(multi-)fragmentation transition in other statistical sys- 
tems far from a thermodynamic limit- has for the first 
time been directly studied for protein structures. 

In a broader sense, our approach offers an extended 
view on protein structures. Although one is still free to 
consider them as analytically deduced- that is, derived 
from an a priori given ensemble V of real protein data 
such as the Protein Data Bank- there is now an alterna- 
tive mode of description in which those observables, and 
within them the encoded information about protein struc- 
ture, are obtained synthetically through the process of ac- 
tually solving the Hamiltonian system once its potential 
energy and the initial conditions have been properly iden- 
tified. Of course, such an synthetic approach would exist 
concurrently to known theoretical models and computa- 
tional algorithms that describe the behavior of protein 
folds. It is worthwhile to stress, though, that our pro- 
posed mechanical system has significantly fewer degrees 
of freedom than does a (quantum) physical set up neces- 
sary to describe protein dynamics in usual 3D space, and 
hence its direct computational study would require much 
lower complexity. Thus we think that forthcoming studies 
of the proposed mechanical system should look to identify 
the potential energy function in order to equip our model 
with true predictive power. 

In the context of the protein design problem, this rep- 
resentation of protein structures is a specific realization of 
the protein design alphabet. This proposition has found 
strong support in our results showing that in native pro- 
tein structures the amino acid usage and the usage of the 
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cycle length alphabet — the latter defined entirely upon 
structure — show striking quantitative similarities. We 
therefore conclude that amino acid sequences directly en- 
code structural information when represented as lengths 
of oriented cycles in protein contact graphs. 

Since in this work we consider only statistical distribu- 
tions of cycle lengths derived from entire protein folds, 
we have no specific information at the level of individual 
amino acids, and therefore we cannot make any predic- 
tions about functionally relevant protein residues, for ex- 
ample. However, it has been shown recently that contact 
frequencies are not uniformly distributed along a protein's 
polypeptide chain and that residues with high closeness 
values, i.e. those residues which are neighbors to many 
other residues in the graph topology, do correlate with 
functional and evolutionary important sites jlfij . And in 
[T?| the concept of global protein design has been em- 
ployed for individual residues assessing a site-specific dis- 
tribution of the amino acid alphabet constrained by local 
structural features. We therefore believe that our results 
may be reasonably applied on a local, i.e. residue specific, 
level as well resulting in a study of distributions {q^ } at 
a chosen residue site i. Such future applications would 
bring us closer to established methods such as the evo- 



lutionary trace that looks for statistically significant cor- 
relations between sequence, structure and function based 
on evolutionary variation [13, [T]j . 

We should finally note that there is another example of 
a quantitative measure being apparently related to the 
cycle frequencies qu introduced here. It is referred to 
as the contact order (CO) originally introduced in pol ] 
as CO — l/{nL)J2j>id(i,j), where n is the number of 
contacts in a protein and d(i,j) is the sequence distance 
function between residues i and j. This measure has been 
mainly studied in the context of the protein folding pro- 
cess. On the contrary, our analysis focuses on already 
folded native protein structures and although sequence 
separation between residues is used in our definition of qk 
as well, CO is a different mathematical construction from 
any of those we have established in this work. It is there- 
fore fair to say that previous studies based on the contact 
order are not directly related to our results. 

The authors thank I. Mihalck and I. Res for their helpful 
comments. Financial support from the grants NIH R01- 
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Figure 1 (Lisewski/Lichtarge) 

FIG. 1: The large graph shows the entropy H c as a function 
of t (measured in A) derived from the DALI/FSSP database 
of protein structures. For comparison, points labeled with '+' 
give the entropy for domains created by Markovian 3D walks 
with step width As = 3.8A. The insert shows the number 
n + of observables qu with qu > for all k 6 {3, . . . ,L m } in 
the same t- interval as H c . The two small arrows pointing up 
depict centers of convex regions in H c . AH C indicates the 
depth of the first convex region estimated as the maximum 
vertical distance between the double tangent (the straight line 
above) and the local convex region of H c . The double tangent 
gives the concave hull of H c (t) (Gibbs construction). 
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Figure 2 (Lisewski/Lichtarge) 

FIG. 2: The distribution of N-C cycle frequencies ex- 

tracted from the DALI/FSSP database with maximum avail- 
able chain length L m — 1058. Both curves show three dif- 
ferent scaling ranges, i.e. ranges with different slopes. Fhe 
first vertical line at k = 23 indicates the boundary of the first 
twenty cycle lengths k £ {3, . . . , 22}; this line intersects with 
the boundary ("knee") between the first (range A) and the 
second scaling range (range B) The linear slope mg at range 
B is for both graphs tub ~ —1-7, while for range C we have 
mc ~ —7.0. Range A shows a clear deviation from in linearity 
in double logarithmic scale. 
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Figure 3 (Lisewski/Lichtarge) 

FIG. 3: Comparison of the amino acid usage (graph depicted 
with '+' and taken from [TU p with the effective number n* 
of oriented cycles (graph depicted with 'o') within a range of 
mean domain lengths of structures taken from DALI/FSSP. 
Standard deviations in domain length are included in both 
graphs. 
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Figure 4 (Lisewski/Lichtarge) 



FIG. 4: Scatter plot of DALI/FSSP domain sequence lengths 
and their corresponding numbers of estimated oriented contact 
graph cycles evaluated at a minimum threshold value t* = 
5.6A. Linear regression gives a slope of 0.91 and an offset 
of —9.5. Lhe upper-left insert shows the Pearson correlation 
coefficient as a function of t, while the lower-right insert does 
the same for the slop m of the resulting dataset given by linear 
regression. 




Figure 5 (Liscwski/Lichtarge) 

FIG. 5: Schematic illustration of the four crossing types of the 
qk = 1/e point in phase space of (qk,Pk)- Fype I and II induce 
a change in chirality Xfc while III and IV do not. 
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Figure 6 (Lisewski/Lichtarge) 

FIG. 6: The total kinetic energy K as function of the threshold 
t realized over two choices (L m ~ 300 and L m = 600) of statis- 
tical ensembles. Strong fluctuations in K are the precursors of 
two first order phase transitions identified by convex regions in 
H c (t), see Fig. and the two horizontal arrows above assign 
those two convex regions. 



